Goto

Collaborating Authors

 separability condition



Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

arXiv.org Machine Learning

Deep networks often have an enormous amount of parameters an d are theoretically capable of overfitting the training data. However, in practice, deep networks trai ned via gradient descent (GD) or its variants often generalize well. This is commonly attributed to the implicit bias of GD, in which GD finds a certain solution that prevents overfitting ( Zhang et al., 2021; Neyshabur et al., 2017; Bartlett et al., 2021). Understanding the implicit bias of GD is one of the central topics in deep learni ng theory. The implicit bias of GD is relatively well-understood when t he network is homogeneous (see Soudry et al., 2018; Ji and Telgarsky, 2018; Lyu and Li, 2020; Ji and Telgarsky, 2020; Wu et al., 2023, and references therein). For linear networks trained on linearly separabl e data, GD diverges in norm while converging in direction to the maximum margin solution ( Soudry et al., 2018; Ji and Telgarsky, 2018; Wu et al., 2023). Similar results have been established for generic homogene ous networks that include a class of deep networks, assuming that the network at initialization can sepa rate the training data.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

The paper was interesting to read and contained many nice ideas, intuitions, and theoretical connections. The technical contributions in Sections 3-5 are sound and clever, although in some cases building on previous works [2,4]. Moreover, the problem formulation as a linear optimization and the least squared solution make the approach computationally favorable. However, there are points that need more clarification/elaboration in the paper too, in order to better judge the practical application and impact of the work. For instance, it's never discussed where the set of low order marginal come from.


Discrete Rényi Classifiers

Neural Information Processing Systems

When the probability distribution P(X, Y) is known, the optimal classifier, leading to the minimum misclassification rate, is given by the Maximum A-posteriori Probability (MAP) decision rule. However, in practice, estimating the complete joint distribution P(X, Y) is computationally and statistically impossible for large values of d. Therefore, an alternative approach is to first estimate some low order marginals of the joint probability distribution P(X, Y) and then design the classifier based on the estimated low order marginals. This approach is also helpful when the complete training data instances are not available due to privacy concerns. In this work, we consider the problem of finding the optimum classifier based on some estimated low order marginals of (X, Y).


Learning Topic Models: Identifiability and Finite-Sample Analysis

arXiv.org Machine Learning

Topic models provide a useful text-mining tool for learning, extracting and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, a formal theoretical investigation on the statistical identifiability and accuracy of latent topic estimation is lacking in the literature. In this paper, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood, which is naturally connected to the concept of volume minimization in computational geometry. Theoretically, we introduce a new set of geometric conditions for topic model identifiability, which are weaker than conventional separability conditions relying on the existence of anchor words or pure topic documents. We conduct finite-sample error analysis for the proposed estimator and discuss the connection of our results with existing ones. We conclude with empirical studies on both simulated and real datasets.


Recovering Joint Probability of Discrete Random Variables from Pairwise Marginals

arXiv.org Machine Learning

Learning the joint probability of random variables (RVs) lies at the heart of statistical signal processing and machine learning. However, direct nonparametric estimation for high-dimensional joint probability is in general impossible, due to the curse of dimensionality. Recent work has proposed to recover the joint probability mass function (PMF) of an arbitrary number of RVs from three-dimensional marginals, leveraging the algebraic properties of low-rank tensor decomposition and the (unknown) dependence among the RVs. Nonetheless, accurately estimating three-dimensional marginals can still be costly in terms of sample complexity, affecting the performance of this line of work in practice in the sample-starved regime. Using three-dimensional marginals also involves challenging tensor decomposition problems whose tractability is unclear. This work puts forth a new framework for learning the joint PMF using only pairwise marginals, which naturally enjoys a lower sample complexity relative to the third-order ones. A coupled nonnegative matrix factorization (CNMF) framework is developed, and its joint PMF recovery guarantees under various conditions are analyzed. Our method also features a Gram-Schmidt (GS)-like algorithm that exhibits competitive runtime performance. The algorithm is shown to provably recover the joint PMF up to bounded error in finite iterations, under reasonable conditions. It is also shown that a recently proposed economical expectation maximization (EM) algorithm guarantees to improve upon the GS-like algorithm's output, thereby further lifting up the accuracy and efficiency. Real-data experiments are employed to showcase the effectiveness.


Shallow Neural Network can Perfectly Classify an Object following Separable Probability Distribution

arXiv.org Machine Learning

Guiding the design of neural networks is of great importance to save enormous resources consumed on empirical decisions of architectural parameters. This paper constructs shallow sigmoid-type neural networks that achieve 100% accuracy in classification for datasets following a linear separability condition. The separability condition in this work is more relaxed than the widely used linear separability. Moreover, the constructed neural network guarantees perfect classification for any datasets sampled from a separable probability distribution. This generalization capability comes from the saturation of sigmoid function that exploits small margins near the boundaries of intervals formed by the separable probability distribution.


Multi-User Multi-Armed Bandits for Uncoordinated Spectrum Access

arXiv.org Machine Learning

The existing spectrum management paradigm treats frequency spectrum as a fixed commodity, which leads to spectrum under utilization. Cognitive radio has emerged as a useful strategy to increase spectrum utilization. The existing literature on cognitive radio has largely been focused on the primary/secondary user paradigm, where secondary users need to detect vacant spectrum when available and vacate the occupied spectrum when a primary user wants to transmit. We focus on a different type of spectrum sharing system in which there is no distinction between users, and in which there is no coordination among the users. The collective performance across all users is more important than that of individual users. This is in contrast to the typical primary/secondary user paradigm in which secondary users bear the responsibility for ensuring priority-based spectrum sharing.


Nonnegative Matrix Factorization for Signal and Data Analytics: Identifiability, Algorithms, and Applications

arXiv.org Machine Learning

Nonnegative matrix factorization (NMF) has become a workhorse for signal and data analytics, triggered by its model parsimony and interpretability. Perhaps a bit surprisingly, the understanding to its model identifiability---the major reason behind the interpretability in many applications such as topic mining and hyperspectral imaging---had been rather limited until recent years. Beginning from the 2010s, the identifiability research of NMF has progressed considerably: Many interesting and important results have been discovered by the signal processing (SP) and machine learning (ML) communities. NMF identifiability has a great impact on many aspects in practice, such as ill-posed formulation avoidance and performance-guaranteed algorithm design. On the other hand, there is no tutorial paper that introduces NMF from an identifiability viewpoint. In this paper, we aim at filling this gap by offering a comprehensive and deep tutorial on model identifiability of NMF as well as the connections to algorithms and applications. This tutorial will help researchers and graduate students grasp the essence and insights of NMF, thereby avoiding typical `pitfalls' that are often times due to unidentifiable NMF formulations. This paper will also help practitioners pick/design suitable factorization tools for their own problems.


Mixture Proportion Estimation via Kernel Embedding of Distributions

arXiv.org Machine Learning

Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture, given samples from the mixture and component. This problem constitutes a key part in many "weakly supervised learning" problems like learning with positive and unlabelled samples, learning with label noise, anomaly detection and crowdsourcing. While there have been several methods proposed to solve this problem, to the best of our knowledge no efficient algorithm with a proven convergence rate towards the true proportion exists for this problem. We fill this gap by constructing a provably correct algorithm for MPE, and derive convergence rates under certain assumptions on the distribution. Our method is based on embedding distributions onto an RKHS, and implementing it only requires solving a simple convex quadratic programming problem a few times. We run our algorithm on several standard classification datasets, and demonstrate that it performs comparably to or better than other algorithms on most datasets.